Independent De - Duplication in Data Cleaning #

نویسندگان

  • Ajumobi Udechukwu
  • Christie Ezeife
  • Ken Barker
  • A. Udechukwu
  • C. Ezeife
چکیده

Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This paper identifies two levels of domain-independence in de-duplication namely: domainindependence at the attribute level, and domain-independence at the record level. The paper then proposes a positional algorithm that achieves domain-independent deduplication at the attribute level, and a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independence at the record level. Experiments show that the proposed techniques achieve more accurate deduplication than the existing algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Position and Profiling in Domain-Independent Warehouse Cleaning

A major problem that arises from integrating different databases is the existence of duplicates. Data cleaning is the process for identifying two or more records within the database, which represent the same real world object (duplicates), so that a unique representation for each object is adopted. Existing data cleaning techniques rely heavily on full or partial domain knowledge. This paper pr...

متن کامل

De-Duplication Scheduling Strategy in Real-Time Data Warehouse

Data quality of the data warehouse is crucial to decision-makers. Data duplication is considered one of the critical factors that affect the data quality. Therefore, data de-duplication is an essential process for data warehousing. Particularly, for a real-time data warehouse, it is necessary to ensure not only the data quality in real-time, but also the performance of the front-end queries and...

متن کامل

Double Cervix with Normal Uterus and Vagina - An Unclassified Müllerian Anomaly

Müllerian anomalies are very common, and a frequent cause of infertility. The most used classification system until now, proposed by the American Society for Reproductive Medicine in 1988, categorizes comprehensively uterine anomalies but fails to classify defects of the cervix or vagina. This is based on a developmental theory that postulates that müllerian duct fusion is unidirectional, begin...

متن کامل

Building a 70 billion word corpus of English from ClueWeb

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus queryin...

متن کامل

Multi-Relational Record Linkage

Data cleaning and integration is typically the most expensive step in the KDD process. A key part, known as record linkage or de-duplication, is identifying which records in a database refer to the same entities. This problem is traditionally solved separately for each candidate record pair (followed by transitive closure). We propose to use instead a multi-relational approach, performing simul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012